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Abstract. In this paper, we study the problem of using contextual data 
points of a data point for its classification problem. We propose to rep¬ 
resent a data point as the sparse linear reconstruction of its context, 
and learn the sparse context to gather with a linear classifier in a su¬ 
pervised way to increase its discriminative ability. We proposed a novel 
formulation for context learning, by modeling the learning of context 
reconstruction coefficients and classifier in a unified objective. In this 
objective, the reconstruction error is minimized and the coefficient spar¬ 
sity is encouraged. Moreover, the hinge loss of the classifier is minimized 
and the complexity of the classifier is reduced. This objective is opti¬ 
mized by an alternative strategy in an iterative algorithm. Experiments 
on three benchmark data set show its advantage over state-of-the-art 
context-based data representation and classification methods. 
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1 Introduction 

Pattern classification is a major problem in machine learning research [32, 5, 
6,13]. The two most important topics of pattern classification are data repre¬ 
sentation and classifier learning. Zhang et al. proposed an efficient multi-model 
classifier for large scale Bio-sequence localization prediction [36]. Zhang et al. 
developed and optimized association rule mining algorithms and implemented 
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them on paralleled micro-architectural platforms [39,38]. Most data representa¬ 
tion and classification methods are based on single data point. When one data 
point is considered for representation and classification, all other data points are 
ignored. However, the other data points other than the data point under consid¬ 
eration, which are called contextual data points, may play important roles in its 
representation and classification. It is necessary to explore the contexts of data 
points when they are represented and/or classified. In this paper, we investigate 
the problem of learning effective representation of a data point from its con¬ 
text guided by its class label, and proposed a novel supervised context learning 
method using sparse regularization and linear classifier learning formulation. 

We propose a novel method to explore the context of a data point, and 
use it to represent it. We use its k nearest neighbors as its context, and try 
to reconstruct it by the data points in its context. The reconstruction errors 
are imposed to be spares. Moreover, the reconstruction result is used as the 
new representation of this data point. We apply a linear function to predict its 
class label from the sparse reconstruction of its context. The motivation of this 
contribution is that for each data point, only a few data points in its context 
is of the same class as itself. To find the critical contextual data points, we 
proposed to learn the classifier together with she sparse context. We mode this 
problem as a minimization problem. In this problem, the context reconstruction 
error, reconstruction sparsity, classification error, and classifier complexity are 
minimized simultaneously. We also problem a novel iterative algorithm to solve 
this minimization problem. We first reformulate it as ist Lagrange formula, and 
the use an alterative optimization method to solve it. 

This paper is organized as follows. In section 2, we introduce the proposed 
method. In section 3, we evaluate the proposed method experimentally. In section 
4, this paper is concluded with future works. 

2 Proposed method 

We consider a binary classification problem, and a training set of n data points 
are given as {(x^,7/i)}?=i) where G is a d-dimensional feature vector of 
the f-th data point, and yi G {-1-1,—1} is the class label of the i-th point. To 
learn from the context of the i-th data point, we find its k nearest neighbors 
and denote them as where x^ is the j-th nearest neighbor of the i-th 

point. They are further organized as a d x fc matrix Xi = [x^, • • • , x^^j G 
where the j-th column is x^. We represent x^ by linearly reconstructing it from 
its contextual points as 


k 

Xi Ri Xi = ^ XijVij = Xi\i (1) 

1=1 

where Xi is its reconstruction, and Vij is the reconstruction coefficient of the j-th 
nearest neighbor. Vi = [vn, ■ ■ ■ ,Vik\^ G R* is the reconstruction coefficient vec¬ 
tor of the i-th data point. The reconstruction coefficient vectors of all the train¬ 
ing points are organized in reconstruction coefficient matrix V = [vi , • • • , v„] G 
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with its i-th column as v^. To solve the reconstruction coefficient vectors, 
we propose the following minimization problem, 


min 

V 




I2+7E 




( 2 ) 


where /3 and 7 are trade-off parameters. In the objective of this problem, the 
first term is to minimize the reconstruction error measured by a squared £2 norm 
penalty between and and the second term is a £i norm penalty to the 

contextual reconstruction coefficient vector v^. 

We design a classifier to classify the f-th data point. 


/(Xi) = w^Xi = (3) 

where w € is the classifier parameter vector. The following optimization 
problem is proposed to learn w. 


min 

w,y,4 



+ a 




S.t. l-yi (w^Wv) < Ci: > 0 : * = 1 ) • • 


(4) 


where 5 l|w||| is the the squared £2 norm regularization term to reduce the com¬ 
plexity of the classifier, is the slack variable for the hinge loss of the f-th 
training point, ^ = [■Jij •'' ) and a is a tradeoff parameter. 

The overall optimization problem is obtained by combining the problems in 
both (2) and (4) as 


I ^ IL IL IL 

w.y,£ z 

s.t. l-Ui (w^Wv) < > 0 i * = 1 ) • • • ) ■«• 

According to the dual theory of optimization, the following dual optimization 
problem is obtained. 



inaxmin i ^||w||^-b a V -b/3 V ||x, - Wv *||2 + tE 

o.e w.V.t Z 

\ 2=1 2 = 1 2=1 

n ""1 ( 6 ) 

+E (1 - y* (w^Wvi) - Ci) “ E f ’ 

i=l J 

S.t. ^ > 0, e > 0, 

where 6 = [di,--- and e = [ei,--- ,e„]^ are Lagrange multipliers. By 

setting the partial derivative of H with regard to w and to zeros, we have 
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n 

Ot — 6 ^ 

^ a > 5i. 

We substitute (7) to (6)to eliminate w and <5, 


( 7 ) 


{ 1 ■ ' ^ 

“2 ^ +/3X1 11^* -XiV.Wl 

i,j=l i=l 

w n '\ (8) 

+7^1 llwlli f 
i=l i=l J 

s.t. a > (5 > 0. 

where a = [a, • • • , 0 ;]''" is a n dimensional vector of all a elements. We solve 
this problem with the alternate optimization strategy. In each iteration of an 
iterative algorithm, we fix 5 first to solve V, and then fix V to solve S. 


Solving V When S is fixed and only V is considered, we solve one by 

one, (8) is further reduced to 


min 

Vi 




Xjvj 


I lx,- — Wv,: 


■7lh 


(9) 


This problem could be solved efficiently by the modified feature-sign search 
algorithm proposed by Gao et al. [2]. 

Solving 6 When V is hxed and only S is considered, the problem in (8) is 
reduced to 


{ 1 

i,j=l 1=1 

s.t. a > S > 0. 

This problem is a typical constrained quadratic programming (QP) problem, 
and it can be solved efficiently by the active set algorithm. 



3 Experiments 

In this section, we evaluate the proposed supervised sparse context learning 
(SSCL) algorithm on several benchmark data sets. 
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3.1 Experiment setup 

In the experiments, we used three date sets, which are introduced as follows: 

— MANET loss data set: The packet losses of the receiver in mobile Ad hoc 
networks (MANET) can be classified into three types, which are wireless 
random errors caused losses, the route change losses induced by node mo¬ 
bility and network congestion. We collect 381 data points for the congestion 
loss, 458 for the route change loss, and 516 data points for the wireless error 
loss for this data set. Thus in the data set, there are 1355 data points in 
total. To extract the feature vector each data point, we calculate 12 features 
from each data point as in [1], and concatenate them to form a vector. 

— Twitter data set: The second data set is a Twitter data set. The target 
of this data set is to predict the gender of the twitter user, male or female, 
given one of his/her Twitter massage. We collected 53,971 twitter massages 
in total, and among them there are 28,012 messages sent by male users, and 
25,959 messages sent by female users. To extract features from each Twitter 
message, we extract Term features, linguistic features, and medium diversity 
features as gender-specific features as in [8]. 

— Arrhythmia data set: The third data set is publicly available at http://arc 
hive.ics.uci.edu/ml/datasets/Arrhythmia. In this data set, there are 452 data 
points, and they belongs to 16 different classes. Each data point has a feature 
vector of 279 features. 

To conduct the experiments, we used the 10-fold cross validation. 


3.2 Experimental Results 


MANET loss datase 



(a) MANET loss data (b) Twitter data set 
set 


(c) Arrhythmia data 
set 


Fig. 1. Boxplots of prediction accuracy of different context-based algorithms. 


Since the proposed algorithm is a context-based classification and sparse 
representation method, we compared the proposed algorithm to three popular 
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context-based classifiers, and one context-based sparse representation method. 
The three context-based classifiers are traditional fc-nearest neighbor classifier 
(KNN), sparse representation based classification (SRBC) [26],and Laplacian 
support vector machine (LSVM) [11]. The context-based sparse representation 
method is Gao et al.’s Laplacian sparse coding (LSC) [3]. The boxplots of the 
10-fold cross validation of the compared algorithms are given in figure 1. From 
the figures, we can see that the proposed method SSCL outperforms all the other 
methods on all three data sets. The second best method is SRBC, which also 
uses sparse context to represent the data point. This is a strong evidence that 
learning a supervised sparse context is critical for classification problem. 



(a) a (b) /3 (c) 7 


Fig. 2. Parameter sensitivity curves. 


Sensitivity to parameters In the proposed formulation, there are three trade¬ 
off parameters, a, /3, and 7 . We plot the curve of mean prediction accuracies 
against different values of parameters, and show them in figure 2. From figure 
2(a) and 2(b), we can see the accuracy is stable to the parameter a and /3. From 
figure 2 (c), we can see a larger 7 leads to better classification performances. 

4 Conclusion and future works 

In this paper, we study the problem of using context to represent and classify 
data points. We propose to use a sparse linear combination of the data points 
in the context of a data point to represent itself. Moreover, to increase the dis¬ 
criminative ability of the new representation, we develop an supervised method 
to learn the sparse context by learning it and a classifier together in an unified 
optimization framework. Experiments on three benchmark data sets show its ad¬ 
vantage over state-of-the-art context-based data representation and classification 
methods. In the future, we will extend the proposed method to applications of in¬ 
formation security [33,27,30,29,28,31,34], bioinformatics [25,24,23,12,15,14, 
7,37,7], computer vision [16,17], and big data analysis using high performance 
computing [43,18,9, 35,4,41,40,39,38,35,10,42,21,20,43,19,22]. 
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